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Abstract 

This paper develops a novel approach to compute and search 
for context-specific similarity relationships between HTML 
resources (Hyperlbxt Markup Language). The similarity is 
computed by decomposing resources into parts, finding sim- 
ilar parts among resources, and then extracting the pattern 
of matched parts into a feature vector. This general ap- 
proach for resource-resource linking is termed part-linking 
and its specialization to HTML resources is termed html- 
linking. Given the vectors describing resource-resource as- 
sociations, the following search process can be employed: 
"Find all resources similar to r 3 in the same manner that 
resource ra is similar to r s J* A neighborhood search model 
is developed to execute and control this style of request and 
find all resources in the network matching the sample struc- 
ture within well-defined windows of approximation. The in- 
novation of the htrnl-linMng model is its matching of textual 
resources based on the locality and organizational structure 
of the content found to be matching between resources. 

1 Introduction 

The amount, variety, and distribution of online information 
is rapidly exploding with the advent of the World Wide Web 
(WWW) information space in the global Internet. Informa- 
tion resources change constantly, and users are faced with 
the daunting challenges of finding, navigating, collecting, 
evaluating, and processing in this dynamic information uni- 
verse. Intelligently integrating and correlating information 
has surfaced as a breakthrough topic governing the suc- 
cess of next-generation information networks such as the 
WWW. This topic is concerned with scalable methods and 
metaphors to abstract, integrate, fuse, or otherwise reduce 
and add value to massive amounts of distributed informa- 
tion. 

One mechanism for bringing such content-orientation to 
the web is to "link" information resources to other semanti- 

*ThIs research is supported in part by DARPA contracts F3O602- 
94-C0207 and N66 001-97-0-8601 

*This author is now at MCC in Austin, TX; bperry@mcc.com 



caUy related resources in the network. Such a value-added 
link service would monitor the information space and at- 
tempt to link resources when their internal content exhibits 
semantically important patterns. 

This paper introduces a content linking service based on 
answering questions of the style, "Find all resources simi- 
lar to r 3 in the same manner that resource r<j is similar to 
r a . w The content association exhibited between r s and rd is 
taken as the association type to search for from resource r a in 
the information network. We introduce the concept of part- 
linking to identify and evaluate the association between tex- 
tual resources- The part-unking concept is then specialized 
to apply to HTML resources, resulting in the html-linking 
model for linking resources on the web. The innovation of 
htrnl-linking lies in that it computes the association between 
resources based on the locality and organizational structure 
of the content matching between two resources. 

1.1 Background and Motivation 

The general mechanism of linking resources is what is com- 
monly referred to as hypertext and has became a fundamen- 
tal tool for navigating and organizing large-scale information 
collections [6, 13]. A pointer from one resource to another 
is referred to as a "hypertext Link* (or just a "link"). A 
hypertext link is, at its most basic level, a connection be- 
tween two units of text from two resources. Without any 
further information describing the link, it serves only as a 
casual suggestion that relevant information may exist at the 
other end of the link. A link type is an attribute associated 
with each hypertext link that gives some idea of the effect, 
or semantic intent, of following the link [18]. Typed links 
are essential for providing contextual focus to navigation 
in large document collections. This is becoming apparent in 
the web community, one of the largest testbeds for hypertext 
developments, where the WWW consortium is actively con- 
sidering the class of link types that will be incorporated into 
the next versions of the HTML hypertext standard [15, 14]. 
Given the size and diversity of large-scale information net- 
works, it has become important to explore techniques for au- 
tomatically (or semi-automatically) implanting typed links 
over the information space [5, 1, 16, 7, 8, 12, 17, 4, 10]. 
Within the auspices of the WWW, these techniques fall into 
two general categories: link discovery and link annotation. 

In link discovery methods [5, 1, 16, 7, 8 } 12], network 
resources axe manipulated as individual documents and the 
goal is to establish semantic links between semantically re- 
lated documents. In this respect, the links are discovered 
and maintained "above" the document space in separate link 
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services. 

The particular structure of the WWW and its domi- 
nant encoding standard, the Hypertext Markup Language 
(HTML), presents opportunities for some unique develop- 
ments in resource linking services. HTML currently sup- 
ports a single typeless link that authors can place inside 
their documents. The existence of such embedded associa- 
tions in the information space lends itself to a new type of 
resource linking service - that of annotating the existing link 
base with labels of the semantic intent , or association, be- 
tween the linked resources. In other words, KTML authors 
point to resources related to their current documents and 
the link annotation services [17, 4, 10] add types or context 
to these links with respect to how they are used in and across 
the information space. In this respect, the types associated 
with links are discovered and maintained "above" the type- 
less HTML information space in separate link annotation 
services. 

In both the link discovery and link annotation approaches, 
the central issue is that of haw ta compute and capture the se- 
mantic, or content-sensitive, asiociation between resources. 
Html-linking is the novel resource association technique in- 
troduced in this paper. Specifically, our html-linking model 
develops an innovative method for capturing the association 
between HTML resources in a manner sensitive to the lo- 
cality and organizational structure of the content matching 
between two resources. As a result, the html-linking meth- 
ods can be used as the resource comparison technique at the 
center of both link discovery and link annotation approaches 
for bringing content-orientation to the web. 

2 Linking Resources by Part Patterns 

Html-linking is developed by introducing a concept of re- 
source part-linking and then specializing this concept to ap- 
ply to HTML resources. The idea behind part-linkirig is the 
"part comparison" techniques originally identified in [1, 2]: 

For the analysis of resource-resource relationships, 
resources must be broken into parts. Depending 
on the nature of the collection a part may be 
a paragraph, sentence group, sentence, or other 
sequence of text. The set of part-pairs exhibit- 
ing "sufficient" similarity are analyzed and the 
general pattern defined by these pairs determines 
the semantic association assigned between the re- 
sources. 

Our work concretizes this idea with the part-linking pro- 
cess depicted in Figure I 1 . This figure shows two resources 
n and ry where each resource is decomposed into a set of 
parts (ptV,pjy) and * global content measure The 
two resources axe then "centered* and a line is drawn be- 
tween each pair of parts upholding a "sufficient 7 * degree of 
similarity. Furthermore, a line is drawn between the global 
content measures if they uphold a "sufficient* degree of sim- 
ilarity. Finally, the semantic association assigned to n -+ r> 
is a function of (1) the degree of global {gifgj) match; (2) 
the common content required for parts to match; and (3) the 
pattern of the lines connecting matched parts. The third cri- 
teria is the innovation of the part-linking work and remains 
an open area for further exploration. The pattern will be 
some function of the number of parts matched, the space 
in between matching parts, the crossings of the lines drawn 

l The work in [1] followed a signiBcantly different approach for 
applying this general "part decomposition and analysis" idea. 




between the matched parts, or other features that can bo 
extzzcted from the structure shown in Figure 3. 



Figure 1: Part Decomposition and Matching 



3 The HTML-Linking Model 

Html-linking captures the part-linking pattern between re- 
sources as a 6-place real- valued association vector (A)\ 

as3oc{T Sl Td t ink) — Asa = {cad,d B <iit<jd, s 9 dif$di^d) 
C B d,d3d,V a d,8sd,f3d>h<t € [0...3] 

where r t is the target resource, r$ is the resource we wish to 
associate with r„ and mjt € [0 ... 1] is the minimum similar- 
ity required for parts from r s and ra to be considered match- 
ing. Given a target resource r 3 , a specific content match 
threshold nth, and an association context A, exact html- 
linking discovers the set R<t of all resources in the network 
such that each resource ry € R<t upholds "ass oc{r , , r> r ttu) = 
A. 71 

To extend to appro artmate matching^ a neighborhood re- 
laxation window is introduced. Given two association vec- 
tors, Aij and A xyi the window function evaluates to true if 
Aij is in the v>k € [0. . . 1] sized window around A sv t 

window{wh^Aii,A xv ) € [true, false] 

Using the ossoc and window functions, html-linking can 
be defined as a service taking four inputs and finding nil 
resources ry in the network such that: 

html-linkingtn, A, m*, Wk) >-> Rd 
such that Vry 6 Rd 
assoc{r $i rj t mk) = A,j and 
window{wh i A A j 1 A) = true 



0) 

A target request such as, "Find all resources simUar to r 9 in 
the same manner that r<* is similar to r„" will generate nn 
html-linking request with the target association context as 
A = assoc(r* } r&, m*). The remainder of this section defines 
the constructs introduced in Equation 1 and demonstrates 
how html-linking performs a specialized part-linking search 
and match over HTML resources, 

3.1 Basis for HTML Resource Decomposition 
3.1.1 HTML Overview 

HTML is a WWW standard [15] for annotating "raw" text 
documents with markers that add information to the chunk 
of text contained within each individual marker, The pur- 
pose of the markup commands are to attach coherent cate- 
gories to clusters of text and separate the otherwise sequen- 
tial document into logical elements. Figure 2 shows a por- 
tion of an HTML document - each tag consists of: a begin 
tag marker, the text to be interpreted in this tag environ- 
ment, and an end tag marker. For html-Hnking purposes, 
the HTML tag set can be considered to consist of two classes 
of tags: 
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1. Sectioning and organization tags group chunks of text 
into logical elements. Examples of tags in this class axe 
paragraph markers, list markers, line breaks, section 
headings, titles, and tables. 

2, Extrinsic tags associate chunks of text with informa- 
tion external to the document itself. Examples of tags 
in this class are anchors, image references, and Java ap- 
plets. Of particular interest is the "anchor" tag which 
associates a chunk of text in the document with any 
URL-addressable external information object. 
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Figure 2: General Anatomy of an HTML Resource 

We should be careful to identify the difference between 
our html-linking process, a value-added external information 
service, and the anchor tags internal to' HTML resources. 
HTML anchors are typeless links - every link has the same 
appearance regardless of the media type at the other end of 
the link and regardless of the semantic association intended 
by the resource author. The accuracy of each link is con- 
strained by the author's limited knowledge of what resources 
exist throughout the entire network. In addition, every user 
sees the exact same set of anchors in a document regardless 
f the profile and goals of the user's information process- 
ing session. Nevertheless, HTML anchors are an important 
metaphor for pointing resources to related media objects. 
The caution is that the "relation" should only and can only 
be taken to imply casual association. Html-linking, on the 
other hand, is a service external to the actual resources and 
based heavily on the typed semantic association that can 
be inferred among resources. HTML anchors provide casual 
and locally computed media associations in the web whereas 
value-added services (such as html-linking) provide heav- 
ily typed and far-reaching associations discovered amongst 
large collections of resources. 

3.1.2 Part Decomposition 

For HTML resources, parts are taken to be contiguous, de- 
lineated sentence groups. In order to identify parts, the doc- 
ument is searched for text sectioning and organization tags 
to use in extracting delineated sentence groups. In summary 
the following general decomposition was developed: 

• Table cells are aggregated and each table is processed 
as a single sentence group (i.e., part). 

• List elements are aggregated and each list is processed 
as a single sentence group. 

• Traditional paragraphs are kept together and processed 
as single sentence groups. 

• Titles and section headings are treated as independent 
parts. 



A single level of this decomposition is performed; that is, 
nested sectioning and organization tags do not create new 
parts although they are used to properly collect the text into 
contiguous chunks. For example, a resource using two lists 
with tables inside the elements of one list and nested lists in 
the other would result in two parts - those corresponding to 
the top-level lists themselves. Using the data sets described 
in Section 4, it was observed that, using this decomposition 
parser, the number of parts in the HTML resources ranged 
from 2 to 40 with an average of 14; and the number of terms 
in each part ranged from 4 to 78 with an average of 28. 

As a iinal step, each part generated by the above "de- 
composition process 0 is annotated with the top level HTML 
tag that caused the part to be extracted. In other words, 
each part is attached one of the following labels: title, head- 
ing, table, list, or paragraph. This labeling does not provide 
the deep semantic structuring performed in more closed and 
domain-specific studies; however it does provide a portable 
structuring that can be applied to HTML documents in gen- 
eral. The overall decomposition performed on HTML re- 
sources is summarized as follows: 

decomp(n) = {(pik, Ik) J Pik is a delineated text chunk } 
pik = terms in the kth part of resource r,- 
h € [title, heading, table, list, paragraph] 

3.1.3 Part Similarity 

When performing part-linking, the parts from each resource 
are compared to see if they match with ' w sufficient ,, similar- 
ity. Thus, a similarity function is assumed that, given two 
parts pi and pj t computes a number in the range [0 ... 1], 
This function returns 1.0 if the parts are an exact match 
and 0.0 if they have no content in common. The empha- 
sis of this work is not to develop the ''correct*' function for 
comparing resource parts but, rather, to allow any part sim- 
ilarity function to be plugged into the part-linking process. 
Given that two parts are simply a set of terms and a struc- 
ture label, the following similarity function was used in the 
experiments with html-linking: 

• If the parts have different labels, then their similarity 
is 0.0. 

• K the parts have the same label, then their similar- 
ity is computed as the ratio of the number of terms 
they have in common to the total number of terms 
they share. The reader is referred to [13] for a gen- 
eral review of such standard term-based techniques for 
comparing textual passages. 

The html-linking model incorporates a control parameter 
that sets the minimum similarity parts must exhibit to align 
and contribute to the part-linking pattern. 

3.2 Basis for HTML Resource Linking 

3.2.1 Features for Match Patterns 

Recall that the emphasis behind part-linking is to "line up" 
two resources, "draw* lines between matching parts, and 
then "analyze" the resulting patterns, as shown in Figure 1. 
Html-linking proceeds to characterize the pattern between 
any pair of resources in a real-valued vector. One tunable 
control parameter and four features are defined for capturing 
a part pattern from an input HTML resource to a specific 
destination resource. Figure 3 shows some resource pairs 
and the match patterns they exhibit. 
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1. minimum match tolerance, m = m* € [0...1], con- 
trols the minimum similarity parts must exhibit before 
declared as matching. Referring to Figure 3, parts with 
a line drawn between them are parts that match with 
a threshold greater than this value. 

2. comprehension, c 6 [0 ... 1], determines the coverage 
of the query resource. It is the number of parts (as a 
percentage of the total parts) from the query-resource 
that participate in the established pattern. 

3. diversity, d € [0 ... 1], determines the coverage of the 
destination resource. It is the number of parts (as a 
percentage of the total parts) from the destination- 
resource that participate in the established pattern. 

4. unity, u € [0 ... 1], determines the unity of the matched 
parts in the query-resource. To compute the unity, 
count the number of parts in the query-resource be- 
tween the lowest and highest indexed matched parts 
- call this count the packJen. Then count the num- 
ber of matched parts in the query-resource - call this 
count the matchJen, The value 

match Jen/pack Jen 

is the "packing factor" t or unity, of the pattern. 

5. scatter, s 6 [0 ... 1], determines the unity of the matched 
parts in the destinaUou-resource. As with the unity 
feature, count the pack Jen and match Jen of the parts 
in the destination-resource- The value 

match Jen f pack Jen 

is the destination packing factor of, or scatter of match- 
ing parts in, the pattern. 
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Figure 3: Part Match Pattern Examples 

The vector ( c, d, u, s) captures the cohesion of the locally 
and structurally confined content matches spread across two 
HTML resources. 

3.2.2 Global Resource Features 

The extrinsic HTML tags are used to create a global behav- 
ior to track among resources. The "anchor tendencies* of an 
HTML resource define two tunable global content features: 

1. fanout, f 6 [0 . . . I], determines the amount of anchor- 
ing a destination resource contains. First, count the 
number of terms in the destination resource - call this 
count termJotaL Then count the number of terms in 
the resource that appear within anchor markup tags - 
call this count anchor-total. The value 

anchor JLoialj term dotal 

is the anchoring, or fanout, of the destination resource. 



2. liveliness, I € [0 . . . I], determines the amount of multi- 
media anchoring a destination resource contains, First, 
count the number of anchors in the destination re- 
source - call this count anchors. Then count the num- 
ber of anchors that point to either image, sound, or 
movie resources (i.e., multimedia resources) - coll this 
count the action-anchors. The value 

actionjxnchorsj anchors 
is the liveliness of the destination resource. 

The vector (f 7 l) captures a global anchoring behavior 
exhibited by a resource. The behavior creates a spectrum 
for web resources spanning from those relatively "content 
self-contained" to those relatively "content referential* , Of 
those resources exhibiting referential tendencies, a spectrum 
is established between those resources incorporating high de- 
grees of multimedia to those referring to inanimate external 
resources. A search for similar resources can constrain this 
vector to the specific hypermedia behaviors to be discov- 
ered. As identified by the AltaVista WWW search engine 
[3], querying the multimedia aspects of web resources is an 
attractive feature that users frequently request and employ 
when available. 

3.2.3 Summary. HTML Association Contexts 

The part-linking correspondence between HTML resources 
is captured in a real-valued association vector. The associa- 
tion from resoixxce r s to resource r* is captured as: 







A*d = {c,d, djd,V,d$ «#d> /id* hd) 


C a d 




comprehension of r g € [0 ... 1] 


did 




diversity of rvr € [0 . . . 1] 


Usd 




unity of r j € [0 ... 1] 


S a d 




scatter of ra € {0 . • • 1] 


fed 




fanout of r rf e{D..,l] 


hd 




liveliness of r<j 6 [0 . . . lj 



where m* € [0 ... 1] specifies the minimum similarity re- 
quired for two parts to be considered a match in the associ- 
ation of r, with r*. Essentially, the context vector between 
two resources can be viewed as expressing the type of aaso* 
ciation between the resources. Therefore, pairs of resources 
that induce the same (or similar) context vector are pairs 
that can be considered as "associated in the same type" 

3.3 Neighborhood Search Model 

Given a known resource r a , a target vector A and a match 
threshold m* defines an exact search for 

all resources ra such that as$oc(r 9i rd t mt;) = A 

The neighborhood search model is introduced to resolve 
the following problem with the exact model; if 

a3$oc{r 3 ,rd t mk) = Ajrf and Aid ^ A 

then can A 9 d be sufficiently close to A to still qualify ra 
as a positive match? A window, or neighborhood, around 
A is introduced for identifying this style of approximately 
matching associations. Assume the existence of a window 
size vfk € [0...1], an association context A t an origin r#, 



320 



* and a resource r* such that a3soc(r s ,r<t t mk) = Asd* Then, 
td is a positive association to r A in window tcjt if: 

window(wk> A a d % A) = true 

The tumrfour function operates on two association vectors, 
Aij and A xy and returns true if Aij is in the Wk sized window 
around -4 X y. 

if Aij = (cy , dy , uy, s»y, /fj , /y) and 

Axy — (c X y, ijyi S S yt /xy» (ry ) 

then iu{utfour(iojt, Aij, A xy ) — true 

Civ G [cxy — Wfc/2 . . . Csy + Wfc/2] and 
dtf £ [rfxy - . - . dxy -J- wjt/2] and 
«•/ 6 [u*y - . . . Uxy -f ty A /2] and 
Sij € [sxy - v>kf2 . . . s xy + wjt/2] and 
/o 6 [f* v - tujk/2 . . . f xy + uijfe/2] and 
fci ^ pxs - tojfc/2 . . . t y + wjb/2] 

In other words, resource Td is in the Wk window of association 
to resource r« if all elements in A,d are within ti>k/2 of the 
corresponding elements in target vector A. 

Each request, in the neighborhood search process, begins 
with an originating source r 9 and retrieves a neighborhood 
of associated destination sources Rd =^{r<f}. In other words, 
the search request tt html-linking(r tf ,jl l mjfe, wjfe)" is submit- 
ted for matching in the information network. The following 
dichotomy exists between the inputs to the html-linldng re- 
quest: 

• Inputs ta and A determine the semantics, or style of 
association, to search for - r B is the example resource 
upon which to base the association and A is the pat- 
tern, or link type, with respect to r a to retrieve. As 
a result, changing any part of r s or A will change the 
underlying meaning of the request. 

• Inputs mjt and Wk determine the neighborhood and 
quality of the search. Changing either m* or Wk will 
not change the meaning of the result set - it will just 
change the size and relevance of the items qualified as 
matching the context (r 0t A). 

Decreasing the value of rot or increasing the value of Wk 
will increase the size of Rd. We would like m* to get as 
small as possible without allowing irrelevant parts to pos- 
itively match during the association computation. In ad- 
dition, we would like Wk to get as big as possible without 
allowing irrelevant patterns to positively match during the 
association computation. This transition from relevant to 
irrelevant relaxations, and the mjt and ur* values it entails, 
is demonstrated in the next section. 

4 Experimenlal Observations 
4,1 Data Sets 

To experiment with the behavior of html-linldng, five data 
sets were gathered from live WWW sources, as shown in 
Table 1. The last row, set Runion represents the union of 
all five sets. The resources gathered from these sources were 
inserted into an html-linking index [11] and the experiments 
described in the next sections were performed. 



4.2 Minimum Content Match 

This experiment observed the behavior of the minimum con- 
tent match control (parameter m s= m* e [0 . . . 1]) in the 
html-linldng model. In the experiments, all pairs of re- 
sources in the data sets were examined, one pair at a time, 
under varying thresholds for m. At each value m = mjfe, 
it was computed what percentage of the possible resource 
pairs exhibited at least one part matching above threshold 
mjfc. Figure 4 shows the graphs for resource sets Runion and 
Rcapn, the graphs for the other 4 data sets showed similar 
behavior. 




0 0.1O.2O.30.4O.50.6O.7O.8O.9 1 
ma ranging from 0.05 Co 1.0 



(b) Hie behavior of m when matching parts from 
resources in set Respn 

Figure 4: Content Match Plots 

The goal of this experiment was to identify if, and where, 
decreasing the m parameter tends to introduce irrelevant 
content matches. The "point of irrelevance" is defined to be 
the point in Figure 4 where the percentage of resources with 
"matchable** parts demonstrates a noticeable spike. The in- 
tuition being that such a spike represents the point where 
the content is matching with such a minimal required thresh- 
old that all parts begin to look like matching chunks of text. 
The interesting fact is that the spike for parameter m oc- 
curred very dose in both the union and espn data sets (as 
it also did in the nando, ticJa, stan, and enn data sets that 
are not shown here). 

When parameter m 0.4, an experimentally ob- 
served break in relaxation occurs where it is ex- 
pected that matching content ceases to hold rel- 
evance to the intended match context. 

Therefore a range of [0.4... 1.0] can be used as the exper- 
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Name 


Domain 


Net Location 


# of Resources 
Gathered 


Rcnn 


CNN online news 


www.cnn.com 


4303 




ESPN online sports server 


espnet .sport szone.com 


2918 




Nando integrated news/sports server 


www.nando.net 


5845 


Rucla 


UCLA computer science dept. 


www.es .ucla.edu 


1923 




Stanford computer science dept. 


. www *cs£ tanford.edu 


2349 








17338 



Table 1: Case Study Resource Gatherings 



imentally observed "useful" range of values for control pa- 
rameter m. 

4.3 Maximum Pattern Window 

This experiment observed the behavior of the maximum 
pattern window control (parameter w = wh € [0 . * . 1]) in 
the htrnl-lkldng model In the experiments, all pairs of re- 
sources with at least one part matching above m = 0.4 were 
examined, one pair at a time, under varying thresholds for 
to. At each value «; = %we computed the average number 
of resources (as a percentage of all possible resource pairs) 
exhibiting patterns that match in a relaxation window of size 
iwt. Figure 5 shows the graphs for resource sets Runicn and 
Rc*pn, the graphs for the other 4 data sets showed similar 
behavior. 




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 I 
v ranjing frcaa D.C to 1.0 In 0.02 Increments 

(a) The behavior of w when matching resources in set 

•ft union 




0 0.1 0.2 0.3 0.4 0.5 0,6 0.1 0.8 0.9 
w ranging £rca 0.0 to 1.0 la 0.02 increnents 

(b) The behavior of w when matching resources in set 

Figure 5: Pattern Window Plots 

The goal of this experiment was to identify if, and where, 
increasing the w parameter tends to introduce irrelevant re- 
source patterns into a match context. The "point of irrel- 



evance 1 ' is defined to be the point in Figure 5 where the 
number of resources with "matchable* parts demonstrates 
a noticeable spike. The intuition being that such a spilco 
represents the point where the match context is so relaxed 
that all resources tend to match the desired context, The 
interesting fact is that the spike for parameter v> occurred 
very close in both the union and espn data sets (as it also 
did in the nando^ uc/a, sian, and enn data sets that are not 
shown here). 

When parameter w as 0.3, an experimentally ob- 
served break in relaxation occurs where it is ex- 
pected that matching patterns cease to hold rel- 
evance to the intended match context. 

Therefore a range of {O.Q... 0.3] can be used as the exper- 
imentally observed "useful* range of values for control pa- 
rameter w. 

4.4 Experimental Summary 

This section presented a "snapshot* of the type of behavior 
observed when matching html-linking association contexts 
under varying content and window relaxations. In the case 
of content relaxations, it was seen that a range of [0.4 , . . 1.0] 
for the m control parameter tended to yield useful results 
over various samples of resources. In the case of window 
relaxations, it was seen that a range of [O.D . . . 0.3) for the w 
control parameter tended to yield useful results over various 
samples of resources. These experiments give a flavor for 
the general observed behavior of html-linking. In [11] these 
observations are coupled with an analytical model of the 
effects the association vector A under varying relaxations. 
The coupling results in an integrated control model, for ad- 
justing the m and w parameters, that predicts the quality 
of the results expected in the html-linking result sets. The 
intent in this paper is to introduce html-linking and demon- 
strate that it tends to show correlated behavior when eval- 
uated over diverse resource collections. 

5 Summary and Future Work 

In this paper we introduced the htmUUnking model for com- 
puting the content association between two HTML resources. 
The association is based on the following three criteria) 

1. Decomposing resources into coherent units, or pas- 



2. Finding "content matching" passages between two re- 
sources. 

3. Computing the "structural match pattern" exhibited 
by the content matching passages between two resources. 
This structural match pattern describes the type of as- 
sociation the resources possess. 
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Our ongoing work involves expanding the methods used in 
the first and third criteria listed above, as summarized be- 
low. 

5.1 Res urce Decomposition 

Our current resource decomposition strategy is dependent 
on the existence of HTML tags in the resources. These tags 
are interpreted as structural markers and a resource is de- 
composed into passages, or parts, based on the placement 
of these tags throughout the document. This decomposi- 
tion strategy has worked well for decomposing HTML doc- 
uments encountered on the web, but is an area for much 
improvement in our implementation. There are two inter- 
esting extensions to this decomposition strategy that we are 
investigating. 

First, the semantic consistency of the parts we extract 
can be solidified. In work such as that done in [9], techniques 
are developed for segmenting textual streams into semanti- 
cally consistent passages. That is, controlled natural lan- 
guage understanding techniques are coupled with structural 
organization cues to create "semantic segmentations 9 * of tex- 
tual objects into consistent passages. The segmentation is 
termed "semantic" because it attempts to keep sentences 
in the same passages when they are discussing the same 
theme. Therefore, the segmentation mediates between to 
decomposition criteria: (1) the semantic, or theme-driven, 
delineation of text; and (2) the actual decomposition of the 
text into sentences, paragraphs, and other structural units. 
Our current decomposition uses only the second criteria (as 
identified by the HTML organization tags). It is anticipated 
that using theme consistency to aid the decision to segregate 
or aggregate text will lead to more coherent and consistent 
passage decompositions. 

Second, our focus has been on strictly the textual aspects 
of documents encountered on the web. Yet, the web is a di- 
verse multimedia information space and documents are often 
aggregations of text, images, video, and various executable 
contents. Currently we are ignoring these non-textual com- 
ponents in our document decomposition and analysis. Our 
future work involves extending the passage decomposition 
of HTML resources to include features and other aspects of 
the non-textual elements interspersed in the documents. 

5.2 Match Patterns 

In this paper, we presented a 6-dimensional feature vector 
used to capture the association between resources. We are 
actively exploring various classes of additional features that 
can be used to describe and match the part pattern exhib- 
ited between resources in large and unconstrained collections 
[11 J. These "new features* fall into two classes. First, there 
are the features that build from our existing decomposition 
strategy and simply add additional match pattern descrip- 
tors. In [2], for example, a similar part-linking approach is 
used where the number of cross-links exhibited by the lines 
between matching parts (refer back to Figure 1) are counted 
and used to extract resource similarities. Second, there are 
the features that manifest from the more advanced decom- 
positions discussed above. In the example of using themes to 
semantically drive the text decomposition, it is the case that 
the semantic distance between contiguous passages is not al- 
ways the same. Thus, when we are aligning two resources, as 
in Figure 1, it may be the case that the parts can be placed 
on a linear scale representing there relative semantic similar- 
ity to one another. This scale could provide additional infor- 
mation when computing the cohesiveness of matching parts 



between two documents. In the example of incorporating 
descriptors of images and other non-textual entities in the 
documents, the association feature space would need to be 
extended to properly match, align, and describe the associa- 
tion between these elements of the document decomposition. 
In either case, it is clear that our current 6-dimensional vec- 
tor has provided promising results but can be expanded to 
incorporate additional general and domain-specific features 
of the association between two documents. 

5.3 Conclusion 

The part-linking basis, upon which html-linMng is built, 
identifies a promising technique for comparing resources based 
on the content and structural associations exhibited among 
their parts. We have developed and demonstrated html- 
linking as an external information service that adds value 
to sets of information collections found on the web. Html- 
linldng can be used as the resource comparison method at 
the center both link discovery and link annotation services. 
The content locality and organizational structure sensitiv- 
ity of part-linking make it a promising technique to further 
explore as a vehicle for identifying semantic resource associ- 
ations on the web. 
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